Search CORE

10 research outputs found

Automatic processing of code-mixed social media content

Author: Barman Utsab
Publication venue: Dublin City University. ADAPT
Publication date: 01/04/2019
Field of study

Code-mixing or language-mixing is a linguistic phenomenon where multiple language mix together during conversation. Standard natural language processing (NLP) tools such as part-of-speech (POS) tagger and parsers perform poorly because such tools are generally trained with monolingual content. Thus there is a need for code-mixed NLP. This research focuses on creating a code-mixed corpus in English-Hindi-Bengali and using it to develop a world-level language identifier and a POS tagger for such code-mixed content. The first target of this research is word-level language identification. A data set of romanised and code-mixed content written in English, Hindi and Bengali was created and annotated. Word-level language identification (LID) was performed on this data using dictionaries and machine learn- ing techniques. We find that among a dictionary-based system, a character-n-gram based linear model, a character-n-gram based first order Conditional Random Fields (CRF) and a recurrent neural network in the form of a Long Short Term Memory (LSTM) that consider words as well as characters, LSTM outperformed the other methods. We also took part in the First Workshop of Computational Approaches to Code-Switching, EMNLP, 2014 where we achieved the highest token-level accuracy in the word-level language identification task of Nepali-English. The second target of this research is part-of-speech (POS) tagging. POS tagging methods for code- mixed data (e.g. pipeline and stacked systems and LSTM-based neural models) have been implemented, among them, neural approach outperformed the other approach. Further, we investigate building a joint model to perform language identification and POS tagging jointly. We compare between a factorial CRF (FCRF) based joint model and three LSTM-based multi-task models for word-level language identification and POS tagging. The neural models achieve good accuracy in language identification and POS tagging by outperforming the FCRF approach. Further- more, we found that it is better to go for a multi-task learning approach than to perform individual task (e.g. language identification and POS tagging) using neural approach. Comparison between the three neural approaches revealed that without using task-specific recurrent layers, it is possible to achieve good accuracy by careful handling of output layers for these two tasks e.g. LID and POS tagging

Irish Universities

DCU Online Research Access Service

NextGen AML: distributed deep learning based language technologies to augment anti money laundering Investigation

Author: Barman Utsab
Burgin Edward
Du Jinhua
Han Jingguang
Hayes Jer
Wan Dadong
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2018
Field of study

Most of the current anti money laundering (AML) systems, using handcrafted rules, are heavily reliant on existing structured databases, which are not capable of effectively and efficiently identifying hidden and complex ML activities, especially those with dynamic and timevarying characteristics, resulting in a high percentage of false positives. Therefore, analysts1 are engaged for further investigation which significantly increases human capital cost and processing time. To alleviate these issues, this paper presents a novel framework for the next generation AML by applying and visualizing deep learning-driven natural language processing (NLP) technologies in a distributed and scalable manner to augment AML monitoring and investigation. The proposed distributed framework performs news and tweet sentiment analysis, entity recognition, relation extraction, entity linking and link analysis on different data sources (e.g. news articles and tweets) to provide additional evidence to human investigators for final decisionmaking. Each NLP module is evaluated on a task-specific data set, and the overall experiments are performed on synthetic and real-world datasets. Feedback from AML practitioners suggests that our system can reduce approximately 30% time and cost compared to their previous manual approaches of AML investigation

Crossref

Irish Universities

DCU Online Research Access Service

DCU: aspect-based polarity classification for SemEval task 4

Author: Arora Piyush
Barman Utsab
Bogdanova Dasha
Cortes Santiago
Foster Jennifer
Tounsi Lamia
Wagner Joachim
Publication venue: Association for Computational Linguistics and Dublin City University
Publication date: 01/01/2014
Field of study

We describe the work carried out by DCU on the Aspect Based Sentiment Analysis task at SemEval 2014. Our team submitted one constrained run for the restaurant domain and one for the laptop domain for sub-task B (aspect term polarity prediction), ranking highest out of 36 systems on the restaurant test set and joint highest out of 32 systems on the laptop test set

CiteSeerX

Crossref

Irish Universities

DCU Online Research Access Service

Automatic processing of code-mixed social media content

Author: Barman Utsab
Publication venue: Dublin City University. ADAPT
Publication date: 01/04/2019
Field of study

Irish Universities

Hexadecane mineralization and denitrification in two diesel fuel-contaminated soils

Author: Barman Utsab
Foster Jennifer
Wagner Joachim
Publication venue
Publication date: 01/01/2000
Field of study

NRC publication: Ye

NRC Publications Archive

Crossref

Irish Universities

DCU Online Research Access Service

DCU-UVT.: Word-Level Language Classification with Code-Mixed Data

Author: Barman Utsab
Chrupala Grzegorz
Foster Jennifer
Wagner Joachim
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2014
Field of study

This paper describes the DCU-UVT team’s participation in the Language Identification in Code-Switched Data shared task in the Workshop on Computational Approaches to Code Switching. Word-level classification experiments were carried out using a simple dictionary-based method, linear kernel support vector machines (SVMs) with and without contextual clues, and a k-nearest neighbour approach. Based on these experiments, we select our SVM-based system with contextual clues as our final system and present results for the Nepali-English and Spanish-English datasets

CiteSeerX

Crossref

DCU Online Research Access Service

Tilburg University Repository

Suomi-skenaariot - työkalu strategiseen ajatteluun

Author: Amitava Das
Jennifer Foster
Joachim Wagner
Utsab Barman
Publication venue: Turku : Tulevaisuuden tutkimuksen seura, 1986-
Publication date: 01/01/1997
Field of study

In social media communication, multilin-gual speakers often switch between lan-guages, and, in such an environment, au-tomatic language identification becomes both a necessary and challenging task. In this paper, we describe our work in progress on the problem of automatic language identification for the language of social media. We describe a new dataset that we are in the process of cre-ating, which contains Facebook posts and comments that exhibit code mixing be-tween Bengali, English and Hindi. We also present some preliminary word-level language identification experiments using this dataset. Different techniques are employed, including a simple unsuper-vised dictionary-based approach, super-vised word-level classification with and without contextual clues, and sequence la-belling using Conditional Random Fields. We find that the dictionary-based approach is surpassed by supervised classification and sequence labelling, and that it is im-portant to take contextual clues into con-sideration.

CiteSeerX

Crossref

DCU Online Research Access Service

National Library of Finland DSpace Services

NextGen AML: distributed deep learning based language technologies to augment anti money laundering Investigation

Author: Barman Utsab
Burgin Edward
Du Jinhua
Han Jingguang
Hayes Jer
Wan Dadong
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/07/2018
Field of study

Irish Universities

Natural Language Processing for Social Media, Second Edition

Author: Abdul-Mageed Muhammad
Akhtar Md Shad
Al-Gaphari Galeb H.
Ali Tanveer
Allan James
Allan James
Artstein Ron
Arunachalam Ravi
Atefeh Farzindar
Avudaiappan Neela
Baccianella Stefano
Bakr Hitham Abo
Balasubramanyan Ramnath
Baldwin Timothy
Balikas Georgios
Barman Utsab
Baroni Marco
Becker Hila
Becker Hila
Becker Hila
Bellaachia Abdelghani
Benson Edward
Benton Adrian
Berger Adam L.
Bergsma Shane
Bermingham Adam
Beverungen Gary
Bing Li
Bizer Christian
Blei David M.
Bollen Johan
Bollen Jonah
Bontcheva Kalina
Boujelbane Rahma
Brantingham Richard
Brew Chris
Burfoot Clinton
Caragea Cornelia
Carletta Jean
Carter Simon
Celli Fabio
Chen Hailiang
Chen Hailiang
Chen Zheng
Chilet Jorge Ale
Choudhury Munmun De
Colbaugh Richard
Coppersmith Glen
Cordeiro Mário
Cucerzan S.
Cunningham Hamish
Daumé Hal
Davidov Dmitry
Debnath Pragna
Delort Jean-Yves
Demir Seniz
Derczynski Leon
Derczynski Leon
Diab Mona
Diana Inkpen
Dlugolinský Stefan
Dodds Peter Sheridan
Dredze Mark
Duan Yajuan
Dunning Ted
Eisenstein Jacob
Eisenstein Jacob
Eisenstein Jacob
Eisenstein Jacob
Ekman Paul
Elfardy Heba
Farzindar Atefeh
Farzindar Atefeh
Farzindar Atefeh
Ferragina Paolo
Fokkens Antske
Ford Dominey Peter
Foster George
Foster Jennifer
Friedman Jerome H.
Gella Spandana
Ghazi Diman
Gil Gonzalo Blazquez
González-Ibáñez Roberto
Gotti Fabrizio
Gotti Fabrizio
Guo Weiwei
Habash Nizar
Han Bo
Han Bo
Han Bo
Harabagiu Sanda
Harrison Phillip G.
He Hangfeng
Hecht Brent
Henrich Verena
Heravi Bahareh Rahmanzadeh
Hoffart Johannes
Holzman Lars E.
Horsmann Tobias
Howes Christine
Hsieh Wen-Tai
Hu Meishan
Huang Fei
Imran Muhammad
Inouye David
Izard Caroll E.
Jehl Laura
Jehl Laura Elisabeth
Jin Xiaotian
Judd Joel
Kashyap Ranjitha
Khabiri Elham
Khan Mohammad
Kim Sang Erik Tjong
Kokkos Athanasios
Lafferty John D.
Lampos Vasileios
Leonard
Lewis Will
Li Jiwei
Li Jiwei
Li Jiwei
Limsopatham Nut
Lin Hui
Ling Wang
Liu Bing
Liu Ji
Liu Wendy
Liu Xiaohua
Liu Xiaohua
Llewellyn Clare
Long Rui
Lui Marco
Lui Marco
Lukin Stephanie
Lösch Uta
Ma Jing
Mao Huina
Marchetti-Bowick Micol
Marcus Mitchell P.
Margaret
Maynard Diana
Metzler Donald
Mishne Gilad
Moghaddam Samaneh
Mohammad Saif M.
Mohammad Saif M.
Mohammady Ehsan
Mohay George
Moro Andrea
Mubarak Hamdy
Munro Robert
Neviarouskaya Alena
Nguyen Dong
Nikfarjam Azadeh
O'Connor Brendan
Oberlander Jon
Ovrelid Lilja
Owoputi Olutobi
Pajzs Julia
Pak Alexander
Pak Alexander
Paranjpe Deepa
Park Minsu
Paul Michael
Peng Fuchun
Peng Nanyun
Pennebaker James W.
Persing Isaac
Petrovic Sasha
Pla Ferran
Plutchik Robert
Poese Ingmar
Popescu Adrian
Popescu Ana-Maria
Porshnev Alexabder
Power Robert
Prapula G.
Ramage Daniel
Rao Delip
Razmara Majid
Riloff Ellen
Ritter Alan
Roller Stephen
Rowe Matthew
Rubin Victoria
Sawaf Hassan
Schler Jonathan
Seddah Djamé
Shaalan Khaled
Shamma D. A.
Sharifi Beaux
Shickel Benjamin
Simsek M. U.
Sinha Priyanka
Sokolova Marina
Strapparava Carlo
Strapparava Carlo
Sul Hong Keel
Titov Ivan
Tkachenko Alexander
Tromp Erik
Uzuner Özlem
Vallor Shannon
Verma Sudha
Wan Stephen
Wang Na
Wang Pidong
Washington
Weerkamp Wouter
Weng Jianshu
William
Wing Benjamin
Witten Ian
Wu Wei
Xie Wei
Yan Rui
Yang Steve Y.
Zbib Rabih
Zesch Torsten
Zhao Wayne Xin
Zhou Liang
Zhou Ning
Publication venue: 'Morgan & Claypool Publishers LLC'
Publication date
Field of study

Crossref